feat(crawler): queue-backed manager revalidation fan-out#4210
Merged
Conversation
Closes #4200 item 2. When a manager rotates its adagents.json, every publisher delegating via ads.txt MANAGERDOMAIN needs re-validation so its authorized_agents view stays in sync. Inline fan-out at managed- network scale would saturate crawler concurrency, so this PR adds a persistent queue and a bounded worker tick. - Migration 471: manager_revalidation_queue table mirroring the shape of catalog_crawl_queue (idempotent insert, next_attempt_after for backoff, partial index for the worker's due-row scan, second index on manager_domain for ops-side per-manager status). - cacheAdagentsManifest reads the previously-cached body before the upsert and compares the contributory subset (authorized_agents, properties) via recursive stable-key canonicalization. Only actual content drift triggers the fan-out — $schema / last_updated / trailing-comment noise is intentionally ignored. - enqueueManagerRevalidation walks publishers WHERE manager_domain = $1 via the partial index added in #4204, so a Raptive-scale rotation enumerates 6K delegating publishers via an index-only scan. Insert is idempotent and resets attempts/backoff on re-enqueue so a fresh manager change supersedes any in-flight retry window. - New crawler tick processManagerRevalidationQueue drains up to 50 rows per 5-minute interval at concurrency 10. Success deletes the row; failure advances exponential backoff (1h / 6h / 1d / 3d) and stores last_error truncated to 500 chars. At those bounds, a 6K- publisher manager rotation propagates within ~10 hours, comfortably ahead of the 60-minute organic re-crawl cadence for any single row. - Tests: integration coverage for queue idempotency, due-row filtering, oldest-first ordering, success deletion, and exponential backoff. Unit coverage for the change-detection helper, including order-sensitivity on authorized_agents and ignored noise fields.
| * insert, exponential backoff on failure, deletion on success. | ||
| */ | ||
| import { describe, it, expect, beforeAll, beforeEach, afterAll } from 'vitest'; | ||
| import { initializeDatabase, closeDatabase, query } from '../../src/db/client.js'; |
Three follow-ups from code review of #4210: 1. Wire startPeriodicManagerRevalidation(5) into http.ts bootstrap next to startPeriodicCatalogCrawl. Without this the worker tick never fires and the queue fills but never drains. Critical fix. 2. Tighten the in-line comment in cacheAdagentsManifest to call out that the enqueue is intentionally outside the upsert transaction: on enqueue failure the cache write is already committed, but the next 60-min routine crawl re-detects drift and re-enqueues, so silent fan-out loss self-heals. 3. Clarify enqueueManagerRevalidation's docstring — the returned count is rows touched (inserts + ON CONFLICT updates), since superseding a stale row counts toward the load-bearing semantic of "this fresh manager change overrides any in-flight backoff."
bokelley
added a commit
that referenced
this pull request
May 8, 2026
Closes #4200 item 5. New POST /api/registry/manager-revalidation-request short-circuits the 60-minute organic crawl cycle: when a manager rotates its adagents.json, ops can hit this endpoint and have every delegating publisher enqueued for immediate re-validation. Thin wrapper around enqueueManagerRevalidation (#4210). Body: { manager_domain }. Returns 202 with publishers_enqueued. Rate-limited via the shared validateAndRateLimitCrawl machinery; key namespaced (manager: prefix) so a manager request doesn't bypass an in-window publisher recrawl on the same domain. Per-agent source enum extension to 'adagents_json_via_manager' is closed as won't-fix on #4200: publishers.discovery_method (#4204) already lets consumers join through and discriminate, and a separate per-agent value would silently exclude managerdomain rows from existing readers filtering on source='adagents_json'. Refs #4200, #4173, #4204, #4210.
This was referenced May 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #4200 item 2. When a manager rotates its `adagents.json`, every publisher delegating via ads.txt `MANAGERDOMAIN` needs re-validation so its `authorized_agents` view stays in sync. Inline fan-out at managed-network scale (Raptive ≈ 6K publishers) would saturate crawler concurrency, so this PR adds a persistent queue and a bounded worker tick.
What lands
At those bounds, a 6K-publisher manager rotation propagates within ~10 hours — comfortably ahead of the 60-minute organic re-crawl cadence that any single row would catch on its own.
Tests
Out of scope
Refs #4200, #4173, #4204.